Reducing Confusion in Active Learning for Part-Of-Speech Tagging

نویسندگان

چکیده

Active learning (AL) uses a data selection algorithm to select useful training samples minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative instances, where annotating these instances may reduce large number errors. However, in empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found surprising result that even oracle scenario know true uncertainty predictions, current far from optimal. Based this analysis, pose problem maximally confusion between particular pairs output tags. Extensive experimentation aforementioned shows our proposed strategy outperforms other strategies by significant margin. We also present auxiliary results demonstrating importance proper calibration models, which ensure through cross-view training, analysis how selects examples more closely follow distribution. The code publicly released here. 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of...

متن کامل

Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging

In this paper we describe an unsupervised learning algorithm for automatically training a rule-based part of speech tagger without using a manually tagged corpus. We compare this algorithm to the Baum-Welch algorithm, used for unsupervised training of stochastic taggers. Next, we show a method for combining unsupervised and supervised rule-based training algorithms to create a highly accurate t...

متن کامل

Learning Character-level Representations for Part-of-Speech Tagging

Distributed word representations have recently been proven to be an invaluable resource for NLP. These representations are normally learned using neural networks and capture syntactic and semantic information about words. Information about word morphology and shape is normally ignored when learning word representations. However, for tasks like part-of-speech tagging, intra-word information is e...

متن کامل

Part-of-Speech Tagging of Transcribed Speech

We used four Part-of-Speech taggers, which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues. The assigned tags were then manually corrected. The correction was first used to evaluate the four taggers, then to retrain them. Despite limited resources in time, money and annotators we reached results comparable to tho...

متن کامل

Iterative Part-of-Speech Tagging

Assigning a category to a given word (tagging) depends on the particular word and on the categories (tags) of neighboring words. A theory that is able to assign tags to a given text can naturally be viewed as a recursive logic program. This article describes how iterative induction, a technique that has been proven powerful in the synthesis of recursive logic programs, has been applied to the t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Transactions of the Association for Computational Linguistics

سال: 2021

ISSN: ['2307-387X']

DOI: https://doi.org/10.1162/tacl_a_00350